NB: The worksheet has beed developed and prepared by Maxim Romanov for the course “R for Historical Research” (U Vienna, Spring 2019).

1 Goals

2 Preliminaries

2.1 Data

We will use the following text files in this worksheet. Please download them and keep them close to your worksheet. Since some of the files are quite large, you want to download them before loading them into R:

In order to make loading these files a little bit easier, you can paste the path to where you placed these files into an isolated variable and then reuse it as follows:

The first two files are articles from “The Daily Dispatch” for the years of 1861 and 1862. The newspaper was published in Richmond, VA — the capital of the Confederate States (the South) during the American Civil War (1861-1865). The last file is a script of the first episode of Star Wars :). In fact, for now, we only need one file of the Dispatch.

2.3 Functions in R (a refresher)

Functions are groups of related statements that perform a specific task, which help breaking a program into smaller and modular chunks. As programs grow larger and larger, functions make them more organized and manageable. Functions help avoiding repetition and makes code reusable.

Most programming languages, R including, come with a lot of pre-defined—or built-in—functions. Essentially, all statements that take arguments in parentheses are functions. For instance, in the code chunk above, read.delim() is a function that takes as its arguments: 1) filename (or, path to a file); 2) encoding; 3) specifies that the file has a header; and 4) not using " as a special character. We can also write our own functions, which take care of sets of operations thet we tend to repeat again and again.

Later, take a look at this video by one of the key R developers, and check this tutorial.

2.3.1 Simple Function Example: Hypothenuse

(From Wikipedia) In geometry, a hypotenuse is the longest side of a right-angled triangle, the side opposite the right angle. The length of the hypotenuse of a right triangle can be found using the Pythagorean theorem, which states that the square of the length of the hypotenuse equals the sum of the squares of the lengths of the other two sides (catheti). For example, if one of the other sides has a length of 3 (when squared, 9) and the other has a length of 4 (when squared, 16), then their squares add up to 25. The length of the hypotenuse is the square root of 25, that is, 5.

Let’s write a function that takes lengths of catheti as arguments and returns the length of hypothenuse:

## [1] "In the triangle with catheti of length 390 and 456, the length of hypothenuse is 600.029999250037"
## [1] 600.03

2.3.2 More complex one: Cleaning Text

Let’s say we want to clean up a text so that it is easier to analyze it: 1) convert everithing to lower case; 2) remove all non-alphanumeric characters; and 3) make sure that there are no multiple spaces:

## [1] "this is a sentence with punctuation which mentions vienna the capital of austria "

3 Texts and text analysis

We can think of text analysis as means of extracting meaningful information from structured and unstructured texts. As historians, we often do that by reading texts and collecting relevant information by taking notes, writing index cards, summarizing texts, juxtaposing one texts against another, comparing texts, looking into how specific words and terms are used, etc. Doing text analysis computationally we do lots of similar things: we extract information of specific kind, we compare texts, we look for similarities, we look differences, etc.

While there are similarities between traditional text analysis, there are of course, also significant differences. One of them is procedural: in computational reading we must explicitely perform every step of our analyses. For example, when we read a sentence, we, sort of, automatically identify the meaningful words — subject, verb, object, etc.; we identify keywords; we parse every word, identifying what part of speeh it is, what is its lemma (i.e. its dictionary form, etc.). By doing these steps we re-construct the meaning of the text that we read — but we do most of these steps almost unconsciously, especially if a text is written in our native tongues. In computational analysis, these steps must be performed explicitely (in the order of growing complexity):

  1. Tokenization: what we see as a text made of words, the computer sees as a continous string of characters (white spaces, punctuation and the like are characters). We need to break such strings into descreet objects that computer can understand construe as words.
  2. Lemmatization: reduces the variety of forms of the same words to their dictionary forms. Another, somewhat similar procedure is called stemming, which usually means the removal of most common suffuxes and endings to get to the stem (or, root) of the word.
  3. POS (part-of-speech tagging): this is where we run some NLP tool that identifies the part of speech of each word in our text.
  4. Syntactic analysis: is the most complicated procedure, which is also usually performed with some NLP tool, which analyzes syntactic relationships within each sentence, identifying its subject(s), verb(s), object(s), etc.

NOTE: NLP — natural language processing.

Some examples:

## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## Attaching package: 'koRpus'
## The following objects are masked from 'package:quanteda':
## 
##     tokens, types
## The following object is masked from 'package:readr':
## 
##     tokenize

The library textstem does lemmatization and stemming, but only for English. Tokenization can be performed with str_split() function — and you can define how you want your string to be split.

## [[1]]
##  [1] "He"     "tried"  "to"     "open"   "one"    "of"     "the"    "bigger"
##  [9] "boxes"  ""      
## 
## [[2]]
##  [1] "The"     "smaller" "boxes"   "did"     "not"     "want"    "to"     
##  [8] "be"      "opened"  ""       
## 
## [[3]]
##  [1] "Different" "forms"     "open"      "opens"     "opened"    "opening"  
##  [7] "opened"    "opener"    "openers"   ""
## [1] "He try to open one of the big box."                           
## [2] "The small box do not want to be open."                        
## [3] "Different form: open, open, open, open, open, opener, opener."
## [1] "He tri to open on of the bigger box."                  
## [2] "The smaller box did not want to be open."              
## [3] "Differ form: open, open, open, open, open, open, open."

Note: It is often important to ensure that all capital letters are converted into small letters or the other way around; additionally, some normalization procedures may be necessary to reduce orthographic complexities of specific languages (for example, ö > oe).

4 Word frequencies and Word clouds

Let’s load all issues of Dispatch from 1862.

We can quickly check what types of articles are there in those issues.

We can create subsets of articles based on their types.

  1. Create subsets for other major types.

  2. Describe problems with the data set and how they can be fixed.

your answer goes here…

Now, let’s tidy them up: to work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.

Stop words is an important concept. In general, this notion refers to the most frequent words/tokens which one might want to exclude from analysis. There are existing lists of stop words that you can find online, and they can work fine for testing purposes.

For research purposes, it is highly advisable to develop your own stop word lists. The process is very simple:

  1. create a frequency list of your tokens/words;
  2. arrange them by frequencies in descending order;
  3. save top 2-3,000 in a tsv/csv file;
  4. open in any table editor;
  5. add a new column and tag those words that you want to exclude. For example, 1 – for to exclude; 0 — for to keep. It is convenient to automatically fill the column with some default value (0), and then you can change only those that you want to remove (1).

You will see that some words, depite their frequency, might be worth keeping. When you are done, you can load them and use anti_join function to filter your corpus.

4.1 Word Frequencies

4.2 Wordclouds

Wordclouds can be an efficient way to visualize most frequent words. Unfortunately, in most cases, wordclouds are not used either correctly or efficiently. (Let’s check Google for some examples).

  1. What can we glean out form this wordcloud? Create a wordcloud for obituaries.
  1. Create a wordcloud for obituaries, but without stop words.
  1. Create a wordcloud for obituaries, but on lemmatized texts and without stop words.
  1. Summarize your observations below. What does stand out in these different versions of wordclouds? Which of the wordclouds you find more efficient? Can you think of some scenarios when a different type of wordcloud can be more efficient? Why?

you answer goes here

For more details on generating word clouds in R, see: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know.

5 Word Distribution Plots

5.1 Simple — a Star Wars Example

This kind of plot works better with texts rather than with newspapers. Let’s take a look at a script of Episode I:

Try names of different characters (“shmi”, “padme”, “anakin”, “sebulba”), or other terms that you know are tied to a specific part of the movie (pod, naboo, gungans, coruscant).

6 Word Distribution Plots: With Frequencies Over Time

For newspapers—and other diachronic corpora—a different approach will work better:

We now can build a graph of word occurences over time. In the example below we search for manassas, which is the place where the the Second Battle of Bull Run (or, the Second Battle of Manassas) took place on August 28-30, 1862. The battle ended in Confederate victory. Our graph shows the spike of mentions of Manassas in the first days of September — right after the battle took place.

Such graphs can be used to monitor discussions of different topic in chronological perspective.

  1. The graph like this can be used in a different way. Try words killed and deserters. When do these words spike? Can you interpret these graphs?

your response goes here

7 KWIC: Keywords-in-Context

Keywords-in-context is the most common method for creating concordances — a view that that allows us to go through all instances of specific words or word forms in order to understand how they are used. The quanteda library offers a very quick and easy application of this method:

Now, we can query the created corpus object using this command: kwic(YourCorpusObject, pattern = YourSearchPattern). pattern= can also take vectors (for example, c("soldier*", "troop*")); you can also search for phrases with pattern=phrase("fort donelson"); window= defines how many words will be shown before and after the match.

If you type View(kwic_test) in your console, an HTML table with all the results will be opened in your browser.

8 Homework